# Install pacman if not installed already, and activate it
if (!require("pacman")) install.packages("pacman")
# Install, update and activate libraries
pacman::p_load(
here,
rio,
skimr,
tsibble,
ISOweek,
slider, # for rolling means
gtsummary,
imputeTS,
pander,
tidyverse
)
# Create tsa_theme
tsa_theme <- theme_bw() +
theme(
plot.title = element_text(face = "bold",
size = 12),
legend.background = element_rect(fill = "white",
size = 4,
colour = "white"),
# legend.justification = c(0, 1),
legend.position = "bottom",
panel.grid.minor = element_blank()
)Practical Session 4: Smoothing and Trends
Mortality Surveillance data in Spain
The Spanish daily mortality monitoring system is run by the National Centre for Epidemiology in Madrid and gathers data from a stable number of municipalities around the country with a computerised death register; the system is representative of the population of Spain and was developed in 2004 with the objective of identifying exceedances in mortality during the summer period. Data since 2000 were collected retrospectively.
Sessions 4-7 and 11 use data from this mortality surveillance system. Session 10 uses data from the same system, but for only one autonomous community in Spain (Aragón).
Expected learning outcomes
By the end of the session, participants should be able to:
Describe, test and fit a trend in surveillance data (simple smoothing and regression);
Assess and interpret the significance of trend in surveillance data.
You are provided with a dataset in R (mortagg.Rdata) which includes variables on week, year, total number of deaths (cases), and population, as well as number of deaths and population among males and females separately.
Task 4.1
Assess visually how the total registered number of deaths has been behaving in the years with available data. Save any changes to a new dataset named `mortagg.
Discuss your results with your peers.
How would you proceed in analysing your data to further understand how mortality behaves?
Task 4.1.1 (Optional)
Would you reach the same conclusions if you were exploring mortality for males and females separately? Discuss with your peers.
Moving average and direct statistical modelling of trends
Moving averages are simple methods to visualise the general trend of a series after removing some of the random day-to-day variation by smoothing the data. This allows you to browse your data for periodicity and observe the general trend. In other words, smoothing the data may remove “noise” from your time series and can facilitate visual interpretation.
Moving averages model a time series by calculating the numerical mean of the values adjacent to it. It is calculated for each observation, moving along the time axis: e.g. at each time \(t\) and for a window of 5 time units, one way of calculating the moving average is by using the observations at \(t-2\), \(t-1\), \(t\), \(t+1\) and \(t+2\).
Regression
Using regression methods against the time variable is a simple and familiar way to look at trends and test the slope with the Wald test provided in the output. The kind of regression and interpretation you will use depends on the nature of your dependent variable – in this case, the number (count) of registered deaths.
Task 4.2
Assess visually the existence of a long-term trend and periodicity in the `mortagg dataset by using smoothing techniques. Where would you centre the smoothing window and why? What is the effect of using different smoothing window centring options?
Task 4.3
Assess statistically whether the registered number of deaths has been stable in the years available in your dataset. How well can you model the registered number of deaths by assessing only the long-term trend?
Task 4.3.1 (Optional)
How could you take the – potentially changing – population of Spain into account in your analyses? Data on population are provided as variables in mortagg$pop.
Task 4.3.2 (Optional)
If you are confident with addressing the tasks above and still have enough time, follow the same procedure with the dis1 dataset. Which regression models may be relevant in this example?
Help for Task 4.1
Import the mortality.Rdata dataset.
mortagg <- import(here("data", "mortagg.csv"))Inspect the data.
str(mortagg)'data.frame': 520 obs. of 6 variables:
$ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
$ week : int 1 2 3 4 5 6 7 8 9 10 ...
$ cases : int 6090 6398 6271 6023 5552 5065 4821 4732 4509 4496 ...
$ cases_m: int 3196 3291 3263 3130 2860 2719 2555 2507 2384 2407 ...
$ cases_f: int 2894 3107 3008 2893 2692 2346 2266 2225 2125 2089 ...
$ pop : int 39953520 39953520 39953520 39953520 39953520 39953520 39953520 39953520 39953520 39953520 ...
summary(mortagg) year week cases cases_m cases_f
Min. :2000 Min. : 1.00 Min. :3748 Min. :1988 Min. :1696
1st Qu.:2002 1st Qu.:13.75 1st Qu.:4283 1st Qu.:2273 1st Qu.:1997
Median :2004 Median :26.50 Median :4516 Median :2389 Median :2137
Mean :2004 Mean :26.50 Mean :4657 Mean :2458 Mean :2198
3rd Qu.:2007 3rd Qu.:39.25 3rd Qu.:4915 3rd Qu.:2591 3rd Qu.:2328
Max. :2009 Max. :52.00 Max. :7343 Max. :3721 Max. :3622
pop
Min. :39953520
1st Qu.:41423520
Median :43260892
Mean :43273082
3rd Qu.:45236004
Max. :46367550
view(mortagg)skimr::skim(mortagg)| Name | mortagg |
| Number of rows | 520 |
| Number of columns | 6 |
| _______________________ | |
| Column type frequency: | |
| numeric | 6 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| year | 0 | 1 | 2004.50 | 2.88 | 2000 | 2002.00 | 2004.5 | 2007.00 | 2009 | ▇▇▇▇▇ |
| week | 0 | 1 | 26.50 | 15.02 | 1 | 13.75 | 26.5 | 39.25 | 52 | ▇▇▇▇▇ |
| cases | 0 | 1 | 4656.57 | 555.52 | 3748 | 4283.00 | 4516.5 | 4915.00 | 7343 | ▇▇▂▁▁ |
| cases_m | 0 | 1 | 2458.36 | 270.83 | 1988 | 2272.75 | 2389.0 | 2591.25 | 3721 | ▇▇▃▁▁ |
| cases_f | 0 | 1 | 2198.22 | 291.34 | 1696 | 1997.00 | 2137.0 | 2328.00 | 3622 | ▇▇▂▁▁ |
| pop | 0 | 1 | 43273082.00 | 2112930.23 | 39953520 | 41423520.00 | 43260892.5 | 45236004.00 | 46367550 | ▅▅▅▂▇ |
Create a time series object with tsibble, using the total case counts and plot it.
mortz <-
mortagg %>%
mutate(date_index = make_yearweek(year = year, week = week)) %>%
as_tsibble(index = date_index)
ggplot(data = mortz) +
geom_line(mapping = aes(x = date_index, y = cases)) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
labs(x = "Date", y = "Number of Deaths", title = "Mortality") +
tsa_theme## Adjusting y-axis scale
ggplot(data = mortz) +
geom_line(mapping = aes(x = date_index, y = cases)) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(2000, NA)) +
labs(x = "Date", y = "Number of Deaths", title = "Mortality") +
tsa_themeHelp for Task 4.1.1
## Men and Women
ggplot(data = mortz) +
geom_line(mapping = aes(x = date_index, y = cases_m),
colour = "green",
lwd=1.1) +
geom_line(mapping = aes(x = date_index, y = cases_f),
colour = "black",
lwd=1.1) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Week", y = "Number of Deaths", title = "Spain: Number of deaths by sex") +
tsa_themeHelp for Task 4.2
According to the slider package description, slider is a package for rolling windows analysis. It means that a given function is repeatedly applied to different “windows” of your data as you step through it. Typical examples of applications of rolling window window functions include moving averages or cumulative sums.
The slider family of functions from the slider package contains a collection of functions specifically designed to compute a given operation on a rolling window. An introduction vignette for slider can be found by running the following command: vignette("slider")
For each record, create the following various types of moving average.
MA5a: the 5-week moving average, centred on cases.MA5b: the 5-week moving average of cases and the 4 previous weeks.MA5c: the 5-week moving average of the 5 previous weeks.
Compare results.
mortzma <-
mortz %>%
mutate(
MA5a = slide_index_dbl(
.x = cases,
.i = date_index,
.f = mean,
na.rm = TRUE,
.before = 2,
.after = 2,
.complete = TRUE
),
MA5b = slide_index_dbl(
.x = cases,
.i = date_index,
.f = mean,
na.rm = TRUE,
.before = 4,
.complete = TRUE
),
MA5c = slide_index_dbl(
.x = cases,
.i = date_index,
.f = function(x) mean(x[-6], na.rm = TRUE),
.before = 5,
.complete = TRUE
)
)
# view first 10 lines of data
head(mortzma, 10)# A tsibble: 10 x 10 [1W]
year week cases cases_m cases_f pop date_index MA5a MA5b MA5c
<int> <int> <int> <int> <int> <int> <week> <dbl> <dbl> <dbl>
1 2000 1 6090 3196 2894 39953520 2000 W01 NA NA NA
2 2000 2 6398 3291 3107 39953520 2000 W02 NA NA NA
3 2000 3 6271 3263 3008 39953520 2000 W03 6067. NA NA
4 2000 4 6023 3130 2893 39953520 2000 W04 5862. NA NA
5 2000 5 5552 2860 2692 39953520 2000 W05 5546. 6067. NA
6 2000 6 5065 2719 2346 39953520 2000 W06 5239. 5862. 6067.
7 2000 7 4821 2555 2266 39953520 2000 W07 4936. 5546. 5862.
8 2000 8 4732 2507 2225 39953520 2000 W08 4725. 5239. 5546.
9 2000 9 4509 2384 2125 39953520 2000 W09 4571 4936. 5239.
10 2000 10 4496 2407 2089 39953520 2000 W10 4465 4725. 4936.
The slide_index_dbl is a slider package function designed to run a pre-specified function on a rolling window of a numeric (_dbl) variable, ordered by an index time (date or date-time) variable. A full description of this function can be found by running the following command: ?slide_index_dbl
The .x argument provides the vector with the numbers that will be used for computation.
The .i argument defines the vector with the time variable that will be used for ordering the data.
The .f argument indicates the function that will be used to perform the intended computations. In this case, an arithmetic mean computation is carried out by using the mean function. The na.rm = TRUE is a varying argument included as a ... argument of slide_index_dbl. (Run the expression args("slide_index_dbl") and take notice on the position/sequence of this function’s arguments).
The .before argument defines the number of observations before the central time point of a given time window to use in the rolling window computation. In the example above for a 5 time points centred moving average, in MA5a, 2 observations are used before the central point, and 2 observations are used after it, using the .after argument.
The complete argument indicates where the computation should be carried in rolling windows that have complete observations.
We applied a more complicated function to compute MA5c, taking a backwards moving window of six values but omitting the latest value (current time point) from the calculation of the average.
## wide format dataset ---
ggplot(data = mortzma, mapping = aes(x = date_index)) +
geom_line(mapping = aes(y = cases), colour = "black") +
geom_line(mapping = aes(y = MA5a), colour = "red") +
geom_line(mapping = aes(y = MA5b), colour = "blue") +
geom_line(mapping = aes(y = MA5c), colour = "green") +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
labs(x = "Date",
y = "Number of Deaths",
title = "Moving averages") +
tsa_themeWe observe that the calculation is similar across these various methods, but is not aligned to the series in the same way. MA5a is centred in the middle of the period used to calculate the mean. MA5b is placed at the end of the period. MA5c is placed one step forward (smoothing functions can be used for forecasting the following point). The “models” provided are similar for a 5-week window, but the lag is different.
Moving average is only one way of smoothing. Other ways of smoothing the data to get a general idea of the trend include, for example, LOESS (locally estimated scatterplot smoothing) smoothing, where the contribution of surrounding observations is weighted, i.e. it is not the arithmetical mean for each set (window) of observations.
ggplot(mortz, mapping = aes(x = date_index, y = cases)) +
geom_point(alpha = 0.5) +
geom_smooth(se = TRUE) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Date",
y = "Number of Deaths",
title = "Loess smoothing") +
tsa_themeThe stat_smooth provides a smoothing curve using a LOESS method. We can change the degree of smoothing by adding a span option. The span controls the amount of smoothing for the default loess smoother; smaller numbers produce wigglier lines, and larger numbers produce smoother lines.
ggplot(data = mortz, mapping = aes(x = date_index, y = cases)) +
geom_point(alpha = 0.5) +
geom_smooth(se = TRUE, span = 0.1) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Date", y = "Number of Deaths", title = "Loess smoothing (span = 0.1)") +
tsa_themeTo better observe the general trend, we need to find the length of the moving average that will erase the seasonal component. Various lengths can be tried; here we have used 25, 51 and 103.
mortzma2 <-
mortz %>%
mutate(
MA25 = slide_index_dbl(
.x = cases,
.i = date_index,
.f = mean,
na.rm = TRUE,
.before = 24,
.complete = TRUE
),
MA51 = slide_index_dbl(
.x = cases,
.i = date_index,
.f = mean,
na.rm = TRUE,
.before = 50,
.complete = TRUE
),
MA103 = slide_index_dbl(
.x = cases,
.i = date_index,
.f = mean,
na.rm = TRUE,
.before = 102,
.complete = TRUE
)
)
## Visually inspect data
slice(mortzma2, 20:30)# A tsibble: 11 x 10 [1W]
year week cases cases_m cases_f pop date_index MA25 MA51 MA103
<int> <int> <int> <int> <int> <int> <week> <dbl> <dbl> <dbl>
1 2000 20 4123 2207 1916 39953520 2000 W20 NA NA NA
2 2000 21 3916 2059 1857 39953520 2000 W21 NA NA NA
3 2000 22 4295 2273 2022 39953520 2000 W22 NA NA NA
4 2000 23 4050 2168 1882 39953520 2000 W23 NA NA NA
5 2000 24 4200 2223 1977 39953520 2000 W24 NA NA NA
6 2000 25 4018 2191 1827 39953520 2000 W25 4708. NA NA
7 2000 26 4104 2146 1958 39953520 2000 W26 4628. NA NA
8 2000 27 4041 2155 1886 39953520 2000 W27 4534. NA NA
9 2000 28 3896 2096 1800 39953520 2000 W28 4439. NA NA
10 2000 29 4021 2142 1879 39953520 2000 W29 4359. NA NA
11 2000 30 3908 2088 1820 39953520 2000 W30 4293. NA NA
slice(mortzma2, 45:55)# A tsibble: 11 x 10 [1W]
year week cases cases_m cases_f pop date_index MA25 MA51 MA103
<int> <int> <int> <int> <int> <int> <week> <dbl> <dbl> <dbl>
1 2000 45 4399 2383 2016 39953520 2000 W45 4037. NA NA
2 2000 46 4456 2446 2010 39953520 2000 W46 4058. NA NA
3 2000 47 4462 2376 2086 39953520 2000 W47 4065. NA NA
4 2000 48 4444 2326 2118 39953520 2000 W48 4081. NA NA
5 2000 49 4271 2315 1956 39953520 2000 W49 4084. NA NA
6 2000 50 4410 2378 2032 39953520 2000 W50 4099. NA NA
7 2000 51 4521 2414 2107 39953520 2000 W51 4116 4406. NA
8 2000 52 4732 2539 2193 39953520 2000 W52 4144. 4379. NA
9 2001 1 4897 2635 2262 40688520 2001 W01 4184. 4350. NA
10 2001 2 4749 2553 2196 40688520 2001 W02 4213. 4320 NA
11 2001 3 4668 2515 2153 40688520 2001 W03 4243. 4293. NA
slice(mortzma2, 95:105)# A tsibble: 11 x 10 [1W]
year week cases cases_m cases_f pop date_index MA25 MA51 MA103
<int> <int> <int> <int> <int> <int> <week> <dbl> <dbl> <dbl>
1 2001 43 4263 2296 1967 40688520 2001 W43 4206. 4371. NA
2 2001 44 4223 2265 1958 40688520 2001 W44 4199. 4367. NA
3 2001 45 4286 2313 1973 40688520 2001 W45 4193. 4364. NA
4 2001 46 4717 2483 2234 40688520 2001 W46 4207. 4369. NA
5 2001 47 4642 2496 2146 40688520 2001 W47 4208. 4373. NA
6 2001 48 4816 2546 2270 40688520 2001 W48 4229. 4383. NA
7 2001 49 4723 2502 2221 40688520 2001 W49 4255 4390. NA
8 2001 50 4837 2580 2257 40688520 2001 W50 4279. 4396. NA
9 2001 51 5078 2753 2325 40688520 2001 W51 4303. 4403. 4407.
10 2001 52 5559 2876 2683 40688520 2001 W52 4353. 4416. 4402.
11 2002 1 5878 3020 2858 41423520 2002 W01 4423. 4438. 4397.
You can use View to examine the first rows of mortzma2.
View(mortzma2)ggplot(mortzma2, mapping = aes(x = date_index)) +
geom_line(mapping = aes(y = cases), colour = "black", lwd=0.7) +
geom_line(mapping = aes(y = MA25), colour = "red", lwd=1.1) +
geom_line(mapping = aes(y = MA51), colour = "blue", lwd=1.1) +
geom_line(mapping = aes(y = MA103), colour = "orange", lwd=1.1) +
scale_x_yearweek(date_labels = "%Y-%W",
date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Week",
y = "Number of deaths",
title = "Moving averages") +
tsa_themeComment on the lines provided by the different smoothing windows. Which one do you think is the best for eliminating seasonality? What would happen if you used an even greater window?
Help for Task 4.3
Using regression against time is a very simple way to look at the trends and test the slope with the Wald test provided.
We will use the standard glm function to fit a Poisson regression model.
For the regression you will need to create a date variable from the year and week (consecutive number of week).
mortz <-
mortz %>%
ungroup() %>%
mutate(index = seq.int(from = 1, to = nrow(.)))
mort_poissontrend <- glm(
formula = cases ~ index,
family = "poisson",
data = mortz
)Check model results.
summary(mort_poissontrend)
Call:
glm(formula = cases ~ index, family = "poisson", data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 8.407e+00 1.300e-03 6469.40 <2e-16 ***
index 1.470e-04 4.282e-06 34.34 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 32912 on 519 degrees of freedom
Residual deviance: 31732 on 518 degrees of freedom
AIC: 37080
Number of Fisher Scoring iterations: 4
The glm function fits a linear regression model to the log of the number of deaths. You can use the names function to see the various components of the output of the glm function, as above. As mentioned in the previous session, the str function is also often useful for looking at the “structure” of the output of an R function.
Variables in R models are specified using notation along the lines of response_variable ~ explanatory_variable_1 + explanatory_variable_2 etc. Check the material from the MVA module. A more detailed explanation of how to build model formulas in R can be found in the Details section of the help page of the formula function: ?stats::formula.
summary, when given the results of fitting a linear regression model, will provide a summary of the fitted trend and associated statistics.
Identify and interpret the intercept and the trend. In a Poisson model on the log(y), the coefficient needs to be exponentiated in order to interpret it for the output y (that is, the original y without the log).
Plot the fitted values against the observed ones.
ggplot(data = mortz, mapping = aes(x = index)) +
geom_point(mapping = aes(y = cases), alpha = 0.5) +
geom_line(mapping = aes(y = fitted(mort_poissontrend)), colour = "green", lwd = 1.1) +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Week", y = "Number of deaths", title = "Number of deaths per week with fitted trend") +
tsa_themeHelp for Task 4.3.1
To take the population into account, we need to include the population in the model. This is done by adding a term called offset(). The parameter beta for the offset, in this case for the population, will be forced to be 1; in other words, the offset will not affect the outcome. The advantage is that we can interpret the IRR of time in terms of mortality rate = cases/population.
mort_poissontrend_pop <- glm(cases ~ index + offset(log(pop)),
data = mortz,
family = "poisson"
)
summary(mort_poissontrend_pop)
Call:
glm(formula = cases ~ index + offset(log(pop)), family = "poisson",
data = mortz)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -9.090e+00 1.300e-03 -6990.58 <2e-16 ***
index -1.764e-04 4.285e-06 -41.17 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 32147 on 519 degrees of freedom
Residual deviance: 30452 on 518 degrees of freedom
AIC: 35801
Number of Fisher Scoring iterations: 4
The summary() function will only provide the coefficients in the original log scale of the poisson model. Using tbl_regression with the argument exponentiate = T we get the IRR coefficient
mort_poissontrend_pop %>%
gtsummary::tbl_regression(
exponentiate = T,
estimate_fun = ~style_number(.x, digits = 4)
)| Characteristic | IRR | 95% CI | p-value |
|---|---|---|---|
| index | 0.9998 | 0.9998, 0.9998 | <0.001 |
| Abbreviations: CI = Confidence Interval, IRR = Incidence Rate Ratio | |||
And plot the predicted agains the original values
ggplot(data = mortz, mapping = aes(x = index)) +
geom_point(mapping = aes(y = cases), alpha = 0.5) +
geom_line(mapping = aes(y = fitted(mort_poissontrend_pop)), colour = "blue", lwd = 1.1) +
#scale_y_continuous(limits = c(0, NA)) +
labs(
x = "Time point", y = "Number of Deaths",
title = "Mortality with fitted trend (Poisson)"
) +
tsa_themeHelp for Task 4.3.2
You can use the salmonellosis example (dis1) to see how an exponential trend might fit the data better.
source("src/tidyverse/session_4.R")For the regression you will still need to create a date variable from the year and week (weeknumber).
dis1 <- import(here("data", "tsa_practice.xlsx"), which = "dis1")
dis1 <-
dis1 %>%
mutate(date = seq.int(from = 1, to = nrow(.)))Generate a new variable lcases as the natural logarithm of cases:
salmo <- dis1
salmo <-
salmo %>%
mutate(lcases = log(cases))
salmoz2 <-
salmo %>%
mutate(date_index = make_yearweek(year = year, week = week)) %>%
as_tsibble(index = date_index)Plot the logarithm of the number of cases according to the time:
ggplot(data = salmoz2, mapping = aes(x = date_index)) +
geom_line(mapping = aes(y = lcases)) +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(x = "Year", y = "Cases (natural log scale)", title = "Log Salmonella data") +
tsa_themeFit a model of lcases against date using linear regression.
logmodel <- lm(lcases ~ date, data = salmoz2)
summary(logmodel)
Call:
lm(formula = lcases ~ date, data = salmoz2)
Residuals:
Min 1Q Median 3Q Max
-2.65732 -0.61829 0.08907 0.64319 2.60617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.0247934 0.0890767 11.51 <2e-16 ***
date 0.0099544 0.0003528 28.22 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.89 on 417 degrees of freedom
(12 observations deleted due to missingness)
Multiple R-squared: 0.6563, Adjusted R-squared: 0.6555
F-statistic: 796.2 on 1 and 417 DF, p-value: < 2.2e-16
Plot the log data and the model against time. Note that fitted values cannot be created where the number of cases is missing. Stata seems to do linear interpolation of the fitted values to fill in the gaps, so we will do that too.
fitted_pois_df <-
enframe(fitted(logmodel), name = "date", value = "ltrend") %>%
mutate(date = as.integer(date))
salmoz2 <-
salmoz2 %>%
left_join(
x = .,
y = fitted_pois_df,
by = "date"
) %>%
mutate(ltrend = imputeTS::na_interpolation(x = ltrend, option = "linear"))We use is.na as above to correctly slot the fitted values back into the time series.
We then use na.approx, mentioned above, to interpolate missing values.
ggplot(data = salmoz2, mapping = aes(x = date)) +
geom_line(mapping = aes(y = lcases)) +
geom_line(mapping = aes(y = ltrend), colour = "green") +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(
x = "Time point",
y = "Log Salmonella cases",
title = "Log Salmonella data with fitted trend"
) +
tsa_themeGenerate a new variable (trend), the anti-log of the prediction (ltrend):
salmoz2 <-
salmoz2 %>%
mutate(trend = exp(ltrend))Plot the real data (cases) and this model (trend) according to time:
ggplot(data = salmoz2, mapping = aes(x = date)) +
geom_line(mapping = aes(y = cases)) +
geom_line(mapping = aes(y = trend), colour = "green") +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(
x = "Time point",
y = "Salmonella cases",
title = "Salmonella data with fitted trend"
) +
tsa_themeCompare the results with those you would have got if you had used linear regression on the original data.
Alternatively, you can run a Poisson or a negative binomial regression to model the data:
salmopoismodel <- glm(cases ~ date, data = salmoz2, family = "poisson")
summary(salmopoismodel)
Call:
glm(formula = cases ~ date, family = "poisson", data = salmoz2)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.310e+00 2.483e-02 52.76 <2e-16 ***
date 1.000e-02 7.118e-05 140.55 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 40792 on 418 degrees of freedom
Residual deviance: 12557 on 417 degrees of freedom
(12 observations deleted due to missingness)
AIC: 14688
Number of Fisher Scoring iterations: 5
There is more on using Poisson regression later on.
fitted_pois2_df <-
enframe(fitted(salmopoismodel), name = "date", value = "trend2") %>%
mutate(date = as.integer(date))
salmoz2 <-
salmoz2 %>%
left_join(
x = .,
y = fitted_pois2_df,
by = "date"
) %>%
mutate(trend2 = imputeTS::na_interpolation(x = trend2, option = "linear"))ggplot(data = salmoz2, mapping = aes(x = date)) +
geom_line(mapping = aes(y = cases)) +
geom_line(mapping = aes(y = trend2), colour = "green") +
scale_x_yearweek(date_labels = "%Y-%W", date_breaks = "1 year") +
scale_y_continuous(limits = c(0, NA)) +
labs(
x = "Time point",
y = "Salmonella cases",
title = "Salmonella data with fitted trend"
) +
tsa_theme